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Abstract 

This  paper  presents  a  methodology  for  automatic  segmenta¬ 
tion  and  recognition  of  continuous  human  activity.  We  seg¬ 
ment  a  continuous  human  activity  into  separate  actions  and 
correctly  identify  each  action.  The  camera  views  the  sub¬ 
ject  from  the  lateral  view.  There  are  no  distinct  breaks  or 
pauses  between  the  execution  of  different  actions.  We  have 
no  prior  knowledge  about  the  commencement  or  termina¬ 
tion  of  each  action.  We  compute  the  angles  subtended  by 
three  major  components  of  the  body  with  the  vertical  axis, 
namely  the  torso,  the  upper  component  of  the  leg  and  the 
lower  component  of  the  leg.  Using  these  three  angles  as  a 
feature  vector,  we  classify  frames  into  breakpoint  and  non¬ 
breakpoint  frames.  Breakpoints  indicate  an  action* s  com¬ 
mencement  or  termination.  We  use  single  action  sequences 
for  the  training  data  set.  The  test  sequences,  on  the  other 
hand,  are  continuous  sequences  of  human  activity  that  con¬ 
sist  of  three  or  more  actions  in  succession.  The  system  has 
been  tested  on  continuous  activity  sequences  containing  ac¬ 
tions  such  as  walking,  sitting  down,  standing  up,  bending, 
getting  up,  squatting  and  rising.  It  detects  the  breakpoints 
and  classifies  the  actions  between  them. 

1  Introduction 

Human  activity  is  a  continuous  flow  of  single  or  discrete 
human  action  primitives  in  succession.  An  example  of  a 
human  activity  is  a  sequence  of  actions  in  which  a  subject 
enters  a  room,  sits  down,  then  stands  up,  walks  forward, 
bends  down  to  pick  up  something,  and  then  gets  up  and 
walks  away.  Each  component  of  the  human  activity,  such 
as  walking,  sitting  down,  standing  up,  bending  down  and 
getting  up,  is  a  discrete  action  primitive.  Methodology  for 
automatic  interpretation  of  such  continuous  activity  is  pre¬ 
sented  in  this  paper.  When  humans  move  from  one  action 
to  another,  they  do  so  smoothly;  transitions  between  actions 

*This  research  was  supported  in  part  by  the  Army  Research  Office  un¬ 
der  contracts  DAAH04-95-I-0494  and  DAAG55-98- 1-0230,  and  by  the 
Texas  Higher  Education  Coordinating  Board,  Advanced  Research  Project 
97-ARP-275. 


are  not  clearly  defined.  In  general,  there  is  no  clear  begin¬ 
ning  or  end  of  an  action.  Therefore,  to  recognize  a  contin¬ 
uous  activity  sequence  such  as  the  one  described  above,  the 
detection  of  transitions  between  actions  is  crucial.  Most  hu¬ 
man  activity  recognition  systems  require  the  input  sequence 
to  be  a  single  action  sequence.  In  other  cases,  systems  rec¬ 
ognize  the  poses  associated  with  an  action  rather  than  a 
complete  action.  This  enables  them  to  recognize  actions 
but  not  necessarily  to  give  an  accurate  temporal  description 
of  each  action.  In  this  paper,  we  analyze  continuous  human 
activity  by  first  automatically  segmenting  the  activity  into 
discrete  actions.  Human  body  motion  is  actually  the  move¬ 
ment  of  the  body’s  parts  or  components,  such  as  the  torso 
or  the  upper  and  the  lower  limbs.  We  find  that  these  com¬ 
ponents,  the  torso  and  upper  and  lower  limbs,  are  the  most 
informative  in  identifying  ’breakpoints’  between  the  actions 
that  we  aim  to  segment  and  recognize.  The  system  gives  an 
output  of  all  actions  that  have  taken  place  during  the  course 
of  a  sequence  and  their  individual  time  intervals  in  terms  of 
time  frames. 

2  Review  of  Previous  Work 

Most  research  in  the  area  of  human  activity  recognition  has 
dealt  with  the  recognition  of  discrete  action  primitives.  Seg¬ 
mentation  and  classification  of  continuous  actions  is  virtu¬ 
ally  unexplored.  The  system  presented  by  Madabhushi  and 
Aggarwal  [7]  classifies  twelve  different  classes  of  actions. 
These  actions  are  walking,  sitting,  standing  up,  bending, 
getting  up,  bending  sideways,  falling,  squatting,  rising  and 
hugging  in  the  frontal  or  lateral  views.  They  track  the  move¬ 
ment  of  the  head  over  successive  frames  and  model  their 
system  using  the  difference  in  the  coordinates  of  the  head. 
They  achieved  a  recognition  rate  of  83  percent.  However, 
each  test  sequence  was  a  discrete  action  primitive.  Bobick 
and  Davis  [4]  used  temporal  templates  for  the  representa¬ 
tion  and  recognition  of  human  actions.  They  classified  18 
aerobic  exercises  in  7  different  orientations.  Once  again, 
the  approach  was  to  recognize  discrete  actions,  which  in 
their  case  were  aerobic  exercises.  Ayers  and  Shah  [2]  pre- 
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sented  a  context-based  action  recognition  system  capable 
of  determining  the  actions  taking  place  in  a  room.  It  rec¬ 
ognized  actions  such  as  walking  into  a  room,  opening  a 
cabinet,  picking  up  a  telephone  and  using  a  computer  ter¬ 
minal.  Although  their  system  gave  a  rundown  of  all  of  the 
actions  taking  place  in  a  room,  it  was  not  able  to  detect  ex¬ 
actly  when  a  subject  proceeds  from  one  action  to  the  next. 
Campbell  and  Bobick  [3]  presented  a  system  to  recognize 
continuous  actions  in  a  limited  context.  Their  system  recog¬ 
nized  nine  fundamental  ballet  steps  from  three-dimensional 
point  data.  They  were  thus  not  using  video  information  but 
3D  data  as  input.  They  used  a  set  of  anatomical  constraints 
to  model  each  step.  They  were  not  detecting  transitions  be¬ 
tween  steps;  whenever  a  certain  set  of  constraints  was  ob¬ 
served  somewhere  during  the  course  of  a  sequence,  the  sys¬ 
tem  labelled  the  corresponding  step.  A  more  recent  related 
work  is  the  one  by  Rui  and  Anandan  [10].  In  this  the  authors 
segment  visual  action  sequences  based  on  detecting  tempo¬ 
ral  discontinuities  in  spatial  motion  patterns.  They  extract 
the  frame  by  frame  optical  flow  and  using  singlular  value 
decomposition  detect  discontinuities  in  trajectories.  These 
discontinuities  are  keypose  frames  that  are  the  boundaries  of 
actions. 


3  Preprocessing  steps:  Segmentation 
and  skeletonization 

The  accurate  segmentation  of  the  subject  in  each  frame  of 
the  sequence  is  critical  to  the  skeletonization  process,  which 
is  sensitive  to  boundary  and  internal  discontinuities.  Back¬ 
ground  subtraction  is  used  to  segment  the  subject  from  the 
scene.  This  is  followed  by  thresholding,  which  yields  a  bi¬ 
nary  image.  The  resulting  image  is  further  processed  using 
morphological  operations  such  as  dilation,  erosion  and  con¬ 
nected  component  labeling.  Fig.l  shows  the  segmentation 
result  for  one  frame  of  a  sequence. 


(a)  (b)  (c)  (d) 

Figure  1:  (a)  background  image  (b)  one  frame  of  a  se¬ 
quence  (c)  thresholded  image  (d)  final  segmented  im¬ 
age 


Human  body  motion  is  the  coordinated  movement  of  dif¬ 
ferent  body  parts  and  the  connected  joints.  We  believe  that 
knowledge  of  limb  and  joint  angles  is  useful  in  detecting 
the  termination  and  commencement  of  different  actions.  A 
number  of  studies  have  used  information  from  the  move¬ 
ment  of  body  parts  such  as  the  trunk,  arms  and  legs  to 
analyze  human  motion.  Rohr  [9]  described  human  walk¬ 
ing  with  joint  angles  of  the  hip,  knee,  shoulder  and  elbow. 
In  the  same  vein,  Bharatkumar  et  al  [1]  used  kinesiology 
data  as  the  basis  for  their  human  walking  model.  Fujiyoshi 
and  Lipton  [5]  used  the  angle  of  inclination  of  the  torso  as 
a  cue  to  the  recognition  of  walking  and  running.  Niyogi 
and  Adelson  [8]  exploited  the  repetitive  information  of  the 
lower  limb  trajectory  for  recognition  of  human  walking. 

In  this  paper  we  present  an  algorithm  that  uses  the  angle 
of  inclination  of  three  major  body  components  to  classify 
frames  into  breakpoint  and  non-breakpoint  frames.  We  then 
classify  the  frames  between  the  breakpoints  into  one  of  the 
actions  present  in  the  database.  The  organization  of  this  pa¬ 
per  is  as  follows.  In  section  3  we  present  the  preprocessing 
steps  applied  to  the  images  in  order  to  extract  the  features. 
Section  4  explains  the  algorithm  for  action  segmentation, 
with  a  detailed  description  of  each  step  and  some  examples. 
Section  5  describes  the  module  for  discrete  action  recogni¬ 
tion,  enumerating  the  features  used  for  classification  of  the 
individual  actions.  System  implementation  and  results  are 
presented  in  section  6,  Conclusions  and  future  directions 
are  outlined  in  section  7. 


Skeletonization  has  been  used  extensively  in  human  mo¬ 
tion  analysis  to  extract  a  skeletonized  image  of  the  human 
subject  or  to  generate  stick  figure  models.  Bharatkumar  et 
al.  [1]  used  the  medial  axial  transformation  to  extract  stick 
figures  and  compared  the  two-dimensional  information  ob¬ 
tained  from  stick  figures  with  that  obtained  from  anthropo¬ 
metric  data.  Guo  et  al.  [6]  also  used  skeletonization  on 
the  extracted  human  silhouette  to  yield  stick  figure  models. 
Since  we  are  working  with  actions  in  the  lateral  view,  skele¬ 
tonization  can  be  used  to  obtain  the  three  main  components 
of  the  body  used  in  our  algorithm,  namely,  the  torso,  the 
upper  component  of  the  legs  and  the  lower  component  of 
the  legs.  Prior  work  has  used  a  priori  information  about  the 
position  of  the  hip  and  the  knee  joints. 


Figure  2:  Configuration  of  points  on  the  skeleton  curve 

The  hip  and  knee  regions  are  detected  by  estimating  the 
highest  points  of  curvature  on  the  skeleton.  Regions  of  high 


Figure  3;  Computation  of  the  angle  of  curvature  a 
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Figure  4:  Skeletonization  result  of  image  frame  in 
Fig.1. 


curvature  will  in  turn  subtend  small  angles  along  the  curve. 
The  skeleton  is  represented  by  a  sequence  of  points  pi  in  the 
image  plane.  For  each  point  pi  along  the  skeleton  curve,  we 
compute  the  opening  angle  a  using  the  following  formula 

a  =  arccos((a^  *f  6^  —  c^)/2  x  a  x  b)  (1) 

where  a,  6,  and  c  are  distances  computed  as  |p  -  =|a|, 

\p—p~  \  =|fe|  and  \p‘^  —p~\  =lc|.  p'^  andp“  are  points  on  the 
skeleton  curve  that  are  on  either  side  of  p  (Fig.2  and  Fig.3) 
and  within  a  specified  pixel  range  5.  6  is  set  to  1/1 0th  the 
height  of  the  skeleton  in  the  present  frame.  This  is  to  take 
into  account  the  change  in  the  height  of  the  skeleton  as  the 
subject  performs  different  actions.  Thus  for  every  point  p 
along  the  curve,  we  find  the  angle  subtended  at  that  point  by 
using  two  other  points  that  are  above  and  below  p  at  a  pixel 
distance  8.  Note  that  5  is  measured  in  pixels  and  a, 6  and 
c  are  absolute  distances  between  the  points.  We  then  find 
points  that  subtend  minimal  angles  along  the  curve.  These 
are  regions  of  high  curvature.  The  hip  is  the  first  point  of 
high  curvature.  The  knee  is  then  the  point  of  high  curvature 
that  occurs  below  the  hip.  Once  the  hip  and  the  knee  regions 


Figure  5:  Schematic  representation  of  the  body  com¬ 
ponents  and  associated  angles 


are  detected,  the  angles  are  computed  with  respect  to  the 
vertical  axis  passing  through  the  hip  and  the  knee. 

Fig.4  shows  the  three  components  of  the  skeleton  for  the 
sequence  frame  depicted  in  Fig.l.  Fig.5  gives  the  corre¬ 
sponding  schematic  representation  of  the  body  components 
and  associated  angles. 


4  Algorithm  for  action  segmentation 

The  algorithm  for  action  segmentation  uses  the  angle  of  in¬ 
clination  of  the  torso,  denoted  by  6t,  the  angle  of  inclina¬ 
tion  of  the  upper  component  of  the  legs,  6u,  and  the  angle 
of  inclination  of  the  lower  component  of  the  legs,  0i .  These 
three  angles  form  a  feature  vector  The  steps  of 

the  algorithm  are  given  below. 

4.1  Computation  of  component  angles 

For  each  frame  of  the  test  sequence,  the  algorithm  computes 
the  three  angles  of  inclination  of  the  body  components.  Dur¬ 
ing  a  continuous  activity  sequence,  Ot,  Ou  and  6i  traverses 
a  series  of  maximas  and  minimas.  Fig.  6  plots  6t,  6u  and 
6i  respectively  for  a  the  sample  test  sequence  illustrated  in 
Fig.  13. 

4.2  Classification  of  frames  into  breakpoint 
frames 

Each  frame  of  a  continuous  sequence  is  represented  by  a 
feature  vector,  {6u  Ou  We  define  two  classes,  a  break¬ 
point  class  and  a  non-breakpoint  class.  Training  vectors  for 
both  classes  are  chosen  from  continuous  sequences.  Frames 


Figure  6:  Angle  of  inclination  of  the  torso,  the  upper 
component  and  the  lower  component  for  the  sample 
test  sequence. 


of  a  continuous  sequence  in  which  a  person  is  at  the  com¬ 
mencement  or  the  termination  of  an  action  are  chosen  to 
form  the  breakpoint  class.  Each  sample  of  this  class  is  rep¬ 
resented  by  a  three-element  feature  vector  represented  by 
{Otb.Oubi^ib}^  Fig-7  shows  some  of  the  training  sample 
frames  that  have  been  used  for  the  breakpoint  class.  16  sam¬ 
ple  frames  were  used  in  the  training. 

Frames  in  which  the  subject  is  in  the  middle  of  executing 
an  action  are  chosen  to  form  the  non-breakpoint  class.  Each 
sample  of  this  class  is  represented  by  Fig.8 

shows  some  of  the  training  sample  frames  that  were  used 
for  the  non-breakpoint  class.  28  sample  frames  were  used 
to  train  the  non-breakpoint  class.  The  three-element  test 
feature  vector  {0t,0u^0i}  is  compared  with  the  training  fea¬ 
ture  vectors  from  each  class.  The  algorithm  computes  two 
Euclidean  distance  measures  Db  and  Dr  between  the  fea- 
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Figure  7:  Samples  of  breakpoint  training  frames 
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Figure  8:  Samples  of  non-breakpoint  training  frames 
for  different  sequences 


ture  vector  for  each  frame  of  a  test  sequence  and  the  train¬ 
ing  feature  vectors  of  the  breakpoint  and  non-breakpoint 
classes  respectively.  A  frame  is  classified  as  a  breakpoint 
frame  if  the  minima  of  the  distance  measure  set  Db  is  less 
than  the  minima  of  measure  set  Dr.  We  are  basically  look¬ 
ing  for  combinations  of  component  angles  that  character¬ 
ize  a  breakpoint  between  actions.  This  becomes  clearer  if 
we  look  at  the  graphs  in  Fig.  6,  which  indicate  the  frames 
that  were  detected  as  breakpoint  frames  for  the  sequence  in 
Fig.  13.  We  observe  that  breakpoints  have  been  detected  at 
frames  that  are  transitions  between  actions. 

4.3  Segmenting  individual  actions  from  a 
continuous  sequence 

Frames  that  lie  between  breakpoint  frames  are  segmented 
as  individual  actions.  The  algorithm  initiates  a  counter  ev¬ 
ery  time  a  breakpoint  frame  is  detected.  The  algorithm 
then  keeps  track  of  frames  and  looks  for  the  next  break¬ 
point  frame.  Once  the  next  breakpoint  is  detected,  frames 
between  the  two  breakpoints  are  classified  as  an  individ¬ 
ual  action.  Sometimes  more  than  one  frame  in  the  vicinity 
of  the  breakpoint  frame  can  get  classified  as  a  breakpoint 
frame.  In  this  case,  we  pick  the  breakpoint  which  yields  the 
smallest  value  of  Db. 

5  Discrete  action  recognition 

In  addition  to  breakpoint  detection,  we  use  the  angle  of  in¬ 
clination  of  the  torso  and  the  upper  component  and  lower 
components  to  classify  the  different  actions.  We  observe 


Figure  9:  Frames  of  a  sitting  down  action  and  its  skele¬ 
ton  showing  the  components 


Figure  10:  Frames  of  a  squatting  action  and  its  skeleton 
showing  the  components 


that  these  angles  traverse  a  characteristic  path  during  the  ex¬ 
ecution  of  each  action.  The  skeletons  for  sitting  down  and 
squatting  are  shown  in  Fig.9  and  Fig.  10  respectively.  The 
feature  vectors  can  be  given  as  follows: 


A1 

=  [On 

,^t2,  ■ 

•  •  >  ^tn] 

(2) 

A2  : 

—  [^ull 

)  • 

•  M^un] 

(3) 

A3 

=  [On 

)^J2,  ■ 

■  •  ,^(n] 

(4) 

where  Al,  A2,  AS  are  the  normalized  vectors  for  the  an¬ 
gle  of  inclination  of  the  torso,  the  upper  component  and  the 
lower  component  of  the  leg  respectively.  The  angles  are 
normalized  for  each  action  by  dividing  them  with  the  max¬ 
ima  of  the  absolute  values  for  that  angle.  The  system  has 
been  trained  on  complete  discrete  action  sequences  that  last 
for  ten  frames.  If  the  test  action  yields  a  feature  vector  with 
fewer  elements  than  the  training  vectors,  it  is  interpolated  to 
the  size  of  the  training  vector.  Similarly,  if  the  test  vector  is 
longer  than  the  training  vector,  then  all  the  training  vectors 
are  interpolated  to  the  length  of  the  test  vector.  All  three 
feature  vectors  are  interpolated  using  cubic  spline  interpo¬ 
lation. 

The  nearest  neighbor  classifier  assigns  the  feature  vec¬ 
tor  {A1,A2,A3}  to  the  same  class  (where  u  G 
{1,2,. ..,7})  as  the  training  feature  vectors  nearest  to  it  in 
the  feature  space.  The  test  sequence  is  assigned  to  the  train¬ 
ing  class  that  yields  the  least  sum  as  computed  in  Eq.  5. 


n  n 

min{y]  {Otk  -  Otu^kf  +  (Ouk  -  Oucjkf 
+  ti^ik-Oiukf}  (5) 

k=l 

6  System  Implementation  and  Re¬ 
sults 

Image  sequences  are  obtained  using  a  fixed  CCD  camera 
working  at  12-15  frames  per  seconds.  The  test  sequences 
range  in  size  from  60  to  80  frames.  Figure  1 1  illustrates  the 
different  system  modules.  The  system  segments  the  body  of 
the  subject.  After  the  body  segmentation,  the  resulting  im¬ 
age  is  skeletonized  and  the  component  angles  are  extracted. 
The  system  then  segments  the  sequence  into  actions  using 
the  detected  breakpoints.  Each  action  is  then  classified  us¬ 
ing  the  nearest  neighbor  classifier.  The  algorithm  has  been 
tested  on  20  sequences  of  continuous  activity.  The  test  and 
training  sequences  are  different  and  6  subjects  have  been 
used  in  this  work.  Each  test  sequence  consists  of  actions 
performed  in  a  continuous  manner  with  no  breaks  or  pauses. 
The  20  sequences  contain,  in  all,  143  actions  and  128  break¬ 
points.  Table  1  gives  the  results  that  have  been  obtained  on 
the  test  sequences.  The  results  demonstrate  the  efficiency 
of  the  algorithm  with  respect  to  breakpoint  detection  and 
action  recognition.  Figure  13  contains  frames  1  to  59  of 
test  sequence  1,  Table  2  gives  the  results  for  this  sequence. 
When  the  breakpoint  is  not  correctly  identified,  the  proba¬ 
bility  of  incorrect  action  classification  is  higher. 
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Number  of : 

Total 

Correct 

Efficiency 

Breakpoints 

128 

no 

85.90 

Actions 

143 

no 

76.92 

Walking 

29 

26 

89.66 

Silting 

13 

10 

76.92 

Standing  up 

16 

12 

75.00 

Bending 

21 

15 

71.42 

Getting  up 

19 

14 

73.68 

Squatting 

24 

18 

75.00 

Rising 

21 

15 

71.42 

Table  1:  Results  for  test  sequences 


7  Conclusion 

We  have  presented  a  methodology  for  automatic  segmen¬ 
tation  and  recognition  of  continuous  human  activity.  The 
activity  sequence  consists  of  actions  that  are  performed  in 
succession  without  any  breaks  or  pauses.  The  sequence  is 
segmented  into  individual  action  primitives.  The  system 
uses  the  angles  subtended  by  three  major  components  of  the 
body  to  classify  frames  of  the  sequence  into  breakpoint  and 
non-breakpoint  frames.  These  components  are  the  torso, 
the  upper  leg  and  the  lower  leg.  The  action  between  two 
breakpoint  frames  is  then  classified  using  a  three-element 
feature  vector.  Although  the  system  is  limited  to  the  lateral 
viewpoint  of  the  body,  we  have  tested  it  with  sequences  in 
which  the  camera  does  not  view  the  subject  from  a  perfect 
lateral  view.  Different  angles  of  inclinations  with  the  plane 
of  the  camera  have  been  tested.  The  system  can  tolerate  a 
deviation  of  40  degrees  in  6i  and  25  degrees  in  62  as  illus¬ 
trated  in  Fig,  12.  We  have  also  tested  on  sequences  in  which 
61  and  62  change  during  the  execution  of  an  activity.  Fig¬ 
ure  14  is  a  sequence  in  which  the  camera  views  the  subject 
at  angle  61  =  40  degrees.  The  results  of  this  sequence  are 
in  Table  3.  The  sequence  has  not  been  numbered  due  to 
page  restrictions.  The  system  starts  to  fail  when  the  high 
points  of  curvature  on  the  skeleton  contour  are  no  longer 
discernible  in  the  field  of  view.  False  hip  and  knee  regions 
also  get  generated  in  certain  cases.  We  would  like  to  find  a 
more  efficient  method  of  detecting  the  hip  and  knee  regions 
under  these  conditions.  Further  tracking  the  trajectory  of 
other  parts  of  the  body  along  with  the  leg  components  could 
make  the  system  more  robust.  This  would  help  in  recogniz¬ 
ing  more  complex  actions  taken  from  different  views. 
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Test  Sequence  1 

Actual  frames 

1-8 

8-17 

17-25 

25-30 

30-37 

37-43 

43-52 

52-59 

action 

walking 

sitting 

standingup 

walking 

squatting 

rising 

bending 

getting  up 

Detected  frames 

1-8 

9-16 

16-24 

25-30 

31-37 

38-45 

45-51 

51-59 

action 

walking 

sitting 

standingup 

walking 

squatting 

rising 

bending 

getting  up 

Table  2:  Results  for  test  sequence  1 


Figure  14:  Frames  1  to  56  of  test  sequence  2 


Test  Sequence  2 

Actual  frames 

32-37 

47-56 

action 

bending 

gettingup 

rising 

Detected  frames 

26-31 

32-36 

47-56 

action 

bending 

gettingup 

rising 

Table  3:  Results  for  test  sequence  2  taken  at  9i=  40  degrees 


